Improve PDF reader and SaveAsPdf compatibility#1755
Improve PDF reader and SaveAsPdf compatibility#1755PrzemyslawKlys merged 14 commits intoEvotecIT:masterfrom
Conversation
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: c4c7387578
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (decodeParmsObj is PdfArray decodeParmsArray && | ||
| filterIndex >= 0 && | ||
| filterIndex < decodeParmsArray.Items.Count && | ||
| ResolveDictionary(decodeParmsArray.Items[filterIndex], objects) is PdfDictionary indexedDict) { |
There was a problem hiding this comment.
Resolve DecodeParms refs before indexing filter params
GetDecodeParms only handles DecodeParms when it is a direct dictionary or a direct array. If a PDF uses an indirect array (for example /DecodeParms 12 0 R where object 12 is [null << /Predictor ... >>]), this branch is skipped, so predictor settings are never applied and Flate-decoded content remains PNG-predicted garbage. That causes text extraction to fail on otherwise valid filtered streams that store decode params indirectly.
Useful? React with 👍 / 👎.
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #1755 +/- ##
==========================================
+ Coverage 69.68% 69.94% +0.25%
==========================================
Files 983 993 +10
Lines 115229 117591 +2362
Branches 23598 24288 +690
==========================================
+ Hits 80300 82246 +1946
- Misses 24839 24962 +123
- Partials 10090 10383 +293 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 88fdde728a
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| foreach (var kid in kids.Items) { | ||
| var child = ResolveDict(kid, objects); | ||
| if (child is not null) { | ||
| TraversePagesNode(child, objects, result, visited); | ||
| } |
There was a problem hiding this comment.
Guard page-tree traversal against cyclic Kids references
TraversePagesNode recursively descends through every /Kids entry but the visited set is only used for leaf /Page objects, not intermediate /Pages nodes. If a malformed or malicious PDF has a /Kids cycle (child points back to an ancestor), ExtractAllText will recurse indefinitely and can stack-overflow the process. Add a visited check (or depth guard) for each traversed node before recursing into children.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: f93de88543
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| return idx >= 0 ? idx + 6 : -1; | ||
| int searchFrom = start; | ||
| while (searchFrom >= 0 && searchFrom < text.Length) { | ||
| int streamIdx = IndexOfKeyword(text, "stream", searchFrom, text.Length); |
There was a problem hiding this comment.
Limit stream detection to real stream tokens
FindObjectEnd now treats the first occurrence of the substring stream as a stream start even inside ordinary object contents (for example, a metadata string like (mainstream)), then skips ahead to the next endstream in later objects. That can make object boundaries jump across objects and cause dictionaries to be misparsed as streams, corrupting the object map and breaking metadata/page parsing on valid PDFs containing the word "stream" in non-stream objects.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: a30e38a337
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
|
|
||
| char c = text[idx]; | ||
| return char.IsWhiteSpace(c) || c is '/' or '<' or '>' or '[' or ']' or '(' or ')' or '%'; |
There was a problem hiding this comment.
Exclude name/string delimiters from stream keyword boundaries
FindObjectEnd relies on IndexOfKeyword("stream", ...), but HasKeywordBoundary currently treats /, (, and ) as valid token boundaries, so text like /stream (name value) or (stream) (literal string) inside a non-stream object is still recognized as a real stream marker. In that case the parser jumps ahead looking for endstream, which can shift object boundaries across later objects and corrupt the parsed object map for otherwise valid PDFs containing the standalone word stream in metadata or names.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2a8123a966
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| } | ||
| return $"ref:{r.ObjectNumber}"; | ||
| } | ||
| if (c is PdfStream) return "stream:direct"; |
There was a problem hiding this comment.
Preserve unique keys for direct page content streams
ContentsKey now returns the constant "stream:direct" for every page whose /Contents is a direct stream, but TraversePagesNodeDeepLimited uses this key for de-duplication via contentKeys.Add(key). That means valid PDFs with multiple pages using direct (non-referenced) content streams will keep only the first page and silently drop the rest from Pages. Use a per-page unique key (or disable content-key dedupe for direct streams) so distinct pages are not collapsed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 7d9765d51d
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| if (_objects.TryGetValue(r.ObjectNumber, out var ind) && ind.Value is PdfArray referencedArray) { | ||
| return ContentsArrayKey(referencedArray); | ||
| } |
There was a problem hiding this comment.
Preserve referenced content-array identity in page dedupe
When /Contents is an indirect reference to a PdfArray, this code now converts the key from ref:<objectNumber> to ContentsArrayKey(...), and TraversePagesNodeDeepLimited uses that key to drop “duplicates.” In valid PDFs, different page objects can point to different content-array objects that contain the same stream references (or reordered wrappers), and this change causes later pages to be silently skipped, reducing Pages and extracted text. Keep the indirect object identity in the dedupe key (or stop content-based dedupe) so distinct pages are not collapsed.
Useful? React with 👍 / 👎.
There was a problem hiding this comment.
💡 Codex Review
Here are some automated review suggestions for this pull request.
Reviewed commit: 2fc7fa534f
ℹ️ About Codex in GitHub
Your team has set up Codex to review pull requests in this repo. Reviews are triggered when you
- Open a pull request for review
- Mark a draft as ready
- Comment "@codex review".
If Codex has suggestions, it will comment; otherwise it will react with 👍.
Codex can also answer questions or update the PR. Try commenting "@codex address that feedback".
| bool trackRecursion = formObjectNumber > 0; | ||
| if (trackRecursion && !activeForms.Add(formObjectNumber)) { | ||
| continue; |
There was a problem hiding this comment.
Guard direct form XObjects against recursive Do cycles
The recursion guard only tracks forms when formObjectNumber > 0, so direct /XObject form streams (returned with objectNumber = 0) are never added to activeForms. A PDF that uses direct form dictionaries with self-reference or a direct-form cycle will recurse indefinitely in CollectTextAndForms, eventually stack-overflowing during GetTextSpans(). Please key recursion tracking for direct streams too (for example by object identity or resource name path), not only indirect object numbers.
Useful? React with 👍 / 👎.
| bool trackRecursion = formObjectNumber > 0; | ||
| if (trackRecursion && !activeForms.Add(formObjectNumber)) { | ||
| return string.Empty; |
There was a problem hiding this comment.
Prevent infinite recursion for direct forms in text extractor
The lightweight extractor has the same guard gap: it only adds forms to activeForms when an indirect object number exists, but TryGetFormStream sets direct form streams to objectNumber = 0. If a page/form resources tree contains direct form XObjects that invoke themselves (or each other), ExtractTextFromContentStream will recurse without termination and can crash with stack overflow. Track direct-form recursion as well, not just indirect references.
Useful? React with 👍 / 👎.
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
|
You have reached your Codex usage limits for code reviews. You can see your limits in the Codex usage dashboard. |
Summary
Root cause
The PDF projects handled the library's own output well, but several external-PDF shapes and a few SaveAsPdf failure paths still relied on narrow assumptions around page inheritance, stream filters, string parsing, and cleanup behavior.
Validation
dotnet test OfficeIMO.Tests/OfficeIMO.Tests.csproj --filter "FullyQualifiedName~PdfReaderAndFooterRegressionTests"dotnet test OfficeIMO.Tests/OfficeIMO.Tests.csproj --filter "FullyQualifiedName~Pdf"Regression results were green on
net8.0,net10.0, andnet472.